In the present document, we aim to conduct a systematic review of the literature on short term health effects of air pollution. The objective is two-fold:
Our paper and analysis overall partly aims at evaluating whether the literature of short terms health effects of air pollution suffers from power and bias issues. For years, the empirical economic literature has been obsessed with unbiasness. Researchers have developed empirical methods and techniques in a quest to retrieve unbiased estimates. Yet, due to their inherent constraints, these techniques limit the set of settings in which one can retrieve causal and unbiased estimates. They also often tend to focus on limited samples of the data, leading to a reduction in sample size? This can ultimately lead to underpowered studies. As underlined by Gelman and Carlin (2014), a lack of power is often associated with type M and type S error. In a quest for unbiasness, practitioners use methods leading to smaller sample size which can ultimately create bias (type M and type S error). We thus aim at analysing whether the literature of short term health effects of air pollution suffers from power, type M and type S error issues.
In this particular document, we focus mainly on the epidemiology literature. We take advantage of a somehow standardized reporting mechanism to retrieve estimates and confidence intervals from abstracts. This enable us to compute design calculations.
In this section, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. We look at what would be the power, type M and type S error if the true effect was a fraction of the measured effect. We retrieved estimates and confidence intervals of articles in the literature of interest in another document. Before looking into the power analysis itself, we look at the characteristics of the articles considered.
We retrieved the articles using the following query:
‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”) AND (“emergency” OR “mortality”) AND NOT (“long term” or “long-term”)) AND (“particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”)’
This query returns 1649 articles. Based on the abstracts, we can briefly explore the main (unsurprising) themes of the articles:
Out of all abstracts returned by the query, 700 display confidence intervals:
In these articles, we retrieve valid effects and confidence intervals in the following proportions:1
| Effect retreived | Number of articles | Proportion |
|---|---|---|
| Yes | 599 | 0.8557143 |
| No | 101 | 0.1442857 |
This corresponds to 1880 valid effects and associated confidence intervals.
Here is a random example of the effects and confidence intervals detected by our method (highlighted in gray):
In this subsection, we investigate whether there are systematic differences between articles displaying an effect that we detected in the abstract and articles that do not display an effect or for which we did not detect the effect.
We first wonder whether there are disparities in publication dates. It might be the case that displaying effects in the abstract was a feature of a given period.
Even though there are slightly more recent (2010-2020) articles for which effects are retrieved, the difference does not seem to be substantial.
We also investigate whether there are differences in the journals in which the articles are published.
For this analysis to be informative, we would need to cluster the journals into groups (eg epidemiology journals, general science journals, etc).
Then, we wonder if the the themes considered in each types of abstracts differ.
Apart from a few key terms, such as CI, 95 for instance, there are no huge variations in the themes.
We do not seem to detect effects more for a pollutant than for another. Note that if an article considers several pollutants, it will appear several times in this graph.
Now that we have quickly compared the articles for which we retrieve an effect an those for which we don’t, we can dig further into the analysis of the estimates retrieved.
In this section, we briefly analyse the effects retrieved. First, we look into the proportion of effects which are significant.
| Significant | Number of effects | Proportion |
|---|---|---|
| No | 90 | 0.0478723 |
| Yes | 1790 | 0.9521277 |
Non surprisingly, most of the effects retrieved here are significant. These effects are reported in the abstracts and with confidence intervals.
We the look into the distribution of the t-scores.
There seems to be some sort of bunching for t-scores above 1.96. In this analysis, we only consider estimates reported in the abstracts. Authors may only report significant estimates in their abstracts even though they also report non significant estimates in the body of the article. This might explain this bunching. We need to investigate this further in order to understand whether this bunching is evidence of publication bias. We could investigate this further by reproducing the present analysis but analyzing the full texts and not only on the abstracts.
We then plot the distribution of the signal to noise ratio, ie the ratio of the point estimate and the width of the confidence interval.
The graph is of course analogous to the previous one. It however informs us that in a large share of the studies, the magnitude of the noise is larger than the magnitude of the effect. Looking in more details into the distribution of the signal to noise ratio, we notice that for 40% of the estimates considered here, the magnitude of the noise is more important than those of the signal.
| Signal to noise ratio | |
|---|---|
| 0% | 0.0322581 |
| 10% | 0.5384542 |
| 20% | 0.6564957 |
| 30% | 0.8215482 |
| 40% | 1.0241936 |
| 50% | 1.3454994 |
| 60% | 2.2152047 |
| 70% | 4.6094737 |
| 80% | 9.8214508 |
| 90% | 23.8192248 |
| 100% | 834.8333333 |
We then turn to the power analysis itself. We aim at evaluating the power, type M and type S errors for each estimate. To compute these values, we would need to know the true effect size. Yet, in general, we do not know what the true effect is. It would be particularly challenging to retrieve what is exactly measured in each analysis since there is no standardized way of reporting the results. A study may for instance claim that a 10 \(\mu g/m^{3}\) increase in PM2.5 concentration leads to an increase of x% in hospital admissions over the course of a year while another study may state that a 2% increase in ozone concentration increases the number of deaths by 3 over a month. Thus, as suggested by Gelman and Carlin (2014), we consider different potential “true” effect sizes. We run robustness checks, to investigate what would be the power, type M and type S error if the true effects were only a fraction of the measured effect. The results are thus only informative. There is no reason to think a priori that a given effect would overestimated. Yet, if by assuming that the true effect is 3/4 of the measured effect, we find that the estimation is likely to be overestimated by a factor of 2, there might be a substantial issue with this estimate.
To do so, we use the package retrodesign which computes post analysis design calculations (power, type M and type S errors). We run the function retro_desing() for several effect sizes.
In a first part, we carry out our analysis on the whole set of articles. We notice that there is some heterogeneity across articles, with some articles displaying a high power and others displaying lower power. Thus, in a second part, we will look in more details at articles displaying low power
We start by computing the average and median power, type M and type S errors.
| Mean | Median | Mean | Median | Mean | Median | |
|---|---|---|---|---|---|---|
| 0.01 | 0.1031186 | 0.0503187 | 56.366388 | 44.340189 | 0.3400676 | 0.4386578 |
| 0.05 | 0.2496683 | 0.0580046 | 11.426730 | 8.937214 | 0.1908713 | 0.2255850 |
| 0.10 | 0.3391992 | 0.0824306 | 5.873157 | 4.548757 | 0.1097782 | 0.0780555 |
| 0.33 | 0.5459089 | 0.4132677 | 2.104172 | 1.541507 | 0.0144482 | 0.0002604 |
| 0.50 | 0.6602571 | 0.7508618 | 1.585378 | 1.160191 | 0.0055088 | 0.0000029 |
| 0.67 | 0.7545065 | 0.9422312 | 1.349583 | 1.034729 | 0.0029794 | 0.0000000 |
| 0.75 | 0.7913018 | 0.9770161 | 1.281254 | 1.014091 | 0.0024007 | 0.0000000 |
| 0.90 | 0.8478207 | 0.9973378 | 1.193230 | 1.001735 | 0.0017218 | 0.0000000 |
| 1.00 | 0.8774069 | 0.9995402 | 1.153525 | 1.000312 | 0.0014286 | 0.0000000 |
Then, we explore graphically the distribution of power, type M and type S error across simulation and for different size of true effect.
A large chunk of articles display high power and low rates of type M and type S error, in each robustness check. However, a non negligible number of articles display lower power and/or some evidence of type M error. Type S error do not seem to be an important issue here. We investigate potential causes for the lack of power and for the type M errors further in the next subsection.
Note that for type M errors, due to some outliers, we used a log scale. Without the log scale and restricting our sample to type M errors lower than 2.5 (95% of our sample, even for a effect considered being one third of the true effect).
We find that, even if the measured effect is the true effect, there is some risk of type M error.
Then, we look how type M and type S error evolve with power in the estimates considered.
There is a one-to-one relationship between power and type M and type S error. Not surprisingly, type M and type S error skyrocket in studies with low power.
We then investigate how average power, type M and type S evolve as a proportion of the true effect size.
Power, type M and type S errors also skyrocket for small values of the true effect (as a proportion of the measured effect). In addition on average, if for each paper of the literature, the true effects are only three quarter of the measured effect, the power would be lower than the usual 80%. Type S error only seem to be an issue for small values of the true effect as a portion of the measured effect. Type M error seems to be more consistently problematic. The shoot up in the previous graph makes it difficult to read the values of type M error when the true effect is not a small portion of the measured effect. We therefore zoom in.
We notice that, on average in the literature, the treatment effects are overestimated, even for large values of the true effect. This result might be linked to some outliers. We therefore look into the evolution of the median effect with true effect size.
We notice that the issue is much less important when looking at the median. This suggests some heterogeneity in terms of power in the literature.
It might also be interesting to look at how power, type M and type S error evolved in time, ie with publication date.
There does not seem to be a clear trend in the evolution of power and type S error. However, type M error seems to have peaked in the 2010s and to be decreasing again recently.
In the previous section, we noticed that a non negligible number of studies seemed to suffer from a low power issue and associated type M error. We consider that effects for which power is lower than 80% if the true effect is 3/4 of the measured effect. 80% is the threshold usually used in power analyses but 3/4 is arbitrary and could be changed easily in a robustness check. Following this criterion, the number and proportion of estimates with low power is as follows:
| Power | Number of estimates | Proportion |
|---|---|---|
| Adequate power | 1186 | 0.6308511 |
| Low power | 694 | 0.3691489 |
We investigate the particularities of the articles with low power. We start by reproducing the analyses used to compare articles for which we retreived an effect and those for which we did not. First, we look into the distribution of publication dates.
It seems that less articles with low power have been published recently, in comparison to articles with adequate power. This confirms our previous finding. We then look into the distribution of articles
Interestingly, some journals, such as “Science of the Total Environment”, the “International Journal of Occupational Medicine and Environmental Health”, the “Chochrane Database of Systematic Reviews”, “Environmental science and pollution research” and the “Journal of Exposure Science and Environmental epidemiology” publish large share of low power studies. On the contrary, BMJ Open publish very few low power studies.
Here also, grouping the journals into big main themes could be more instructive.
We also look into disparities
There does not seem to be stark differences by pollutant type.
Note that a bunch of abstracts contain the phrase “CI” without actually displaying effects and confidence intervals.↩︎